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INTELLIGIBILITY  OF  ICAO  SPELLING  ALPHABET  WORDS  AND  DIGITS 
USING  SEVERELY  DEGRADED  SPEECH  COMMUNICATION  SYSTEMS 


PART  I:  NARROWBAND  DIGITAL  SPEECH 


BACKGROUND 

There  are  a  number  of  operational  contexts  in  which  voice  communication  is  extremely  impor¬ 
tant,  but  voice  quality  can  be  expected  to  be  moderately  to  severely  degraded.  An  example  is  when 
there  is  considerable  interference,  as  in  a  jamming  environment.  Very  low  data-rate  digital  voice  sys¬ 
tems  (800  bit/s  and  less)  are  being  developed  to  be  more  robust  under  jamming  conditions;  such  sys¬ 
tems  can  also  be  expected  to  have  somewhat  lower  intelligibility  scores  than  existing  systems  at 
higher  data  rates. 

The  evaluation  of  voice  communication  systems  have  been  concerned  primarily  with  the  intelli¬ 
gibility  (and  rated  acceptability)  of  the  speech  signal.  Large  numbers  of  intelligibility  and  quality 
tests  have  been  carried  out  with  the  goal  of  selecting  the  best  possible  system  at  a  given  data  rate. 
The  Digital  Voice  Processor  Consortium  has  used  the  Diagnostic  Rhyme  Test  (DRT)  (Voiers,  1977a) 
and  the  Diagnostic  Acceptability  Measure  (DAM)  (Voiers,  1977b)  to  evaluate  a  variety  of  narrowband 
and  broadband  digital  voice  systems  (e.g.,  Sandy,  1984).  These  tests  are  also  very  useful  for  evaluat¬ 
ing  improvements  to  existing  systems  (e.g.,  the  LPC  Improvement  Program  conducted  at  NRL  by 
George  Kang).  Because  of  its  high  reliability  and  stability  over  time,  the  DRT  has  proved  to  be  a 
very  useful  tool  for  comparing  the  intelligibility  of  a  variety  of  speech  systems.  The  Digital  Voice 
Processor  Consortium  suggests  a  set  of  descriptors  for  various  ranges  of  DRT  scores  (Table  1).  In 
this  context,  a  DRT  score  below  70  is  generally  considered  to  be  “unacceptable,"  and  this  is  a  useful 
criterion  for  selecting  or  developing  voice  systems. 


Table  1  —  Descriptors  for 
Interpreting  DRT  Scores 


DRT  Score 

Descriptive  Label 

100-95 

EXCELLENT 

95-90 

VERY  GOOD 

90-85 

GOOD 

85-80 

FAIR 

80-75 

POOR 

75-70 

VERY  POOR 

below  70 

UNACCEPTABLE 
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the  quality  of  the  communication  link;  the  communicator  must  work  with  the  existing  system  even 
though  it  might  be  judged  to  be  “unacceptable”  in  the  laboratory.  Standard  DRT  intelligibility  test 
scores  are  not  at  present  very  informative  about  user  performance  in  these  very  poor  voice  conditions. 
Even  though  providing  a  set  of  descriptors  does  help  somewhat,  many  would-be  users  have  no  refer¬ 
ence  frame  for  interpreting  DRT  scores  in  terms  of  performance  measures  that  they  can  understand, 
e.g.,  how  many  operational  words  are  correctly  understood.  Field  tests  are  not  a  very  satisfactory 
solution  because  they  are  very  expensive  to  conduct,  and  they  are  also  highly  context-dependent  so 
that  the  results  tend  to  vary  with  changing  conditions,  making  it  difficult  to  generalize  any  conclusions 
to  other  situations. 

This  research  was  aimed  at  providing  a  better  understanding  of  the  effects  of  poor  quality  speech 
on  human  communication  performance.  It  has  been  observed  that  in  real-world  contexts,  people  often 
find  a  voice  system  to  be  acceptable  even  though  it  may  sound  fairly  unintelligible  to  a  naive  observer 
who  is  accustomed  to  telephone  quality.  For  example,  communications  between  a  ground  station  and 
a  military  aircraft  may  be  successful  even  though  the  DRT  score  for  the  combination  of  the  voice  sys¬ 
tem  and  aircraft  noise  is  low.  When  communications  occur  in  the  real  world,  the  listener  has  two 
kinds  of  information  available  to  him  to  help  him  understand  the  message:  speech  information  and 
situational  information.  Intelligibility  tests  like  the  DRT  are  designed  to  measure  only  the  speech 
information.  When  the  speech  signal  is  degraded,  the  situational  information  can  make  it  possible  to 
understand  an  otherwise  unintelligible  message.  In  the  ground-station-to-aircraft  situation,  there  is  a 
rigid  protocol  for  communication  that  is  known  to  the  users.  Furthermore,  they  use  a  limited  and 
highly  specialized  vocabulary  consisting  of  words  and  phrases  that  are  designed  to  be  easily  dis¬ 
tinguished.  The  type  of  information  that  is  likely  to  be  communicated  is  also  dependent  on  the  nature 
of  the  mission  and  the  stage  in  the  sequence  of  the  mission.  At  any  given  time,  there  are  only  a  few 
alternative  messages  that  are  likely  to  be  expected  in  that  context. 

As  suggested  by  the  previous  example,  two  contextual  factors  that  affect  intelligibility  are  the 
size  of  the  expected  message  set  and  the  discriminability  of  the  different  messages  within  the  set. 
Both  of  these  effects  are  illustrated  in  an  experiment  by  Miller,  Heise,  and  Lichten  (1951).  They 
conducted  intelligibility  tests  using  varying  numbers  of  monosyllabic  words  as  the  test  alternatives; 
also  included  were  the  digits  zero  to  “niner.”  Figure  1  shows  the  scores  for  some  of  the  conditions 
in  this  experiment,  plotted  from  the  Miller,  Heise,  and  Lichten  data  after  applying  a  correction  for 
guessing.  As  the  number  of  response  alternatives  decreased,  a  higher  percentage  of  correct  responses 
was  obtained,  even  under  severe  noise  conditions.  When  the  response  choices  were  the  ten  digits,  a 
set  that  includes  words  with  varying  numbers  of  syllables,  the  intelligibility  scores  were  higher  than 
for  the  set  of  eight  monosyllables  and  nearly  as  good  as  when  there  were  only  two  choices.  A  vocab¬ 
ulary  consisting  of  words  that  vary  in  the  number  of  syllables  and  in  their  stressed  vowels  is  easier  to 
recognize  than  the  same  size  set  of  words  consisting  of  monosyllables,  especially  if  the  phonetic  con¬ 
tent  is  also  similar.  Under  high  noise  or  otherwise  degraded  conditions,  the  vowels,  especially  the 
stressed  vowels,  and  the  syllable  patterns  of  the  words  become  the  most  important  cues  for  word 
recognition.  The  consonants  have  less  energy  and  are  more  easily  masked  by  noise  interference. 

The  International  Civil  Aviation  Organization  (ICAO)  spelling  alphabet  was  developed  (Moser, 
1959;  Moser  and  Dreher,  1955)  so  that  the  words  would  be  highly  discriminable  from  one  another, 
and  these  words  can  be  used  where  ordinary  language  would  not  be  easily  understood.  Because  it  is 
widely  used  and  represents  a  small  but  distinctive  vocabulary,  the  ICAO  alphabet  and  the  names  of 
the  digits  for  zero  to  “niner"  were  chosen  for  the  experiments  that  are  described  here.  Results  using 
these  words  can  be  expected  to  be  similar  to  those  that  would  be  obtained  with  other  small,  special¬ 
ized  communication  vocabularies. 

Two  types  of  speech  degradation  were  selected  for  investigation:  narrowband  digital  speech 
with  varying  bit  error  rates,  and  varying  degrees  of  analog  radio  jamming.  The  report  will  be  in  two 
parts.  Part  1  covers  the  narrowband  digital  speech  research,  and  Part  2  will  cover  the  analog  jammed 
speech  research. 
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Fig.  1  —  Intelligibility  scores  vary  as  a  function  of  the  size  of  the  response  set  and  the 
distinctiveness  of  the  responses  within  the  set.  (Data  adapted  from  Miller,  Heise,  and  Lichten, 
1951.)  Corrections  for  guessing  were  made  before  plotting  the  data.  No  curve  has  been  drawn 
for  the  digits  data,  which  lie  very  close  to  the  two-choice  data.  Digits  are  more  distinctive  than 
monosyllables. 


INTRODUCTION 

For  narrowband  secure  voice  communications,  a  linear  predictive  coding  (LPC)  algorithm 
operating  at  2400  bits/s  (Tremain,  1982)  has  been  established  as  the  DoD  standard  (Federal  Standard 
1015  or  MIL- STD-188-1 13),  and  has  also  been  adopted  by  NATO  (STANAG  4198).  This  algorithm 
is  being  incorporated  in  the  Navy’s  Advanced  Narrowband  Secure  Voice  Terminal  (ANDVT)  as  well 
as  the  new  Subscriber  Terminal  Unit  (STU  III)  voice  terminals  in  production.  The  Digital  Voice  Pro¬ 
cessor  Consortium  conducted  DRT  tests  of  the  LPC  processor  with  random  bit  error  rates  up  to  5% 
(Sandy,  1984).  At  the  5%  error  rate,  the  DRT  score  for  the  LPC  processor  was  75,  which  is  in  the 
poor  to  very  poor  range.  Since  the  purpose  here  was  to  investigate  performance  under  severely 
degraded  voice  conditions,  bit  error  rates  up  to  12%  were  tested.  In  addition,  a  very  low  data  rate 
algorithm  operating  at  800  bits/s  was  included.  The  800  bit/s  rate  is  based  on  the  standard  LPC  and 
employs  a  pattern  matching  algorithm  (Fransen,  1983).  This  algorithm  is  being  developed  for  narrow 
band  antijam  applications  and  could  be  used  at  the  2400  bit/s  rate  with  the  remaining  bits  as  error 
protection  (Kang  and  Jewett.  1986). 

METHOD 


For  the  voice  conditions  selected,  two  sets  of  scores  were  obtained.  DRT  tapes  were  processed 
and  sent  to  Dynastat,  Inc.  for  scoring.  Listener  tests  using  the  spelling  alphabet  words  and  the  names 
of  the  digits  were  also  conducted.  The  listeners  heard  random  groups  of  words  and  they  wrote  down 
the  letter  or  digit  corresponding  to  each  word. 

Speech  Materials 

For  the  DRT  tests,  standard  DRT  tapes  for  three  male  and  three  female  speakers  were  used. 
For  the  alphabet  tests,  the  26  ICAO  spelling  alphabet  words  (ALFA.  BRAVO.  CHARLIE.  DELTA, 
etc.)  and  the  names  of  the  digits  (zero  through  “niner"  were  used.  There  were  five  speakers:  three 
males  and  two  females.  Each  speaker  first  read  the  36  words  in  alphabetical  and  numerical  order  and 
then  read  four  different  randomized  lists  of  the  words.  The  randomizations  were  read  in  four  groups 
of  nine  words  each  with  a  brief  pause  after  each  group.  These  lists  were  duplicated,  and  the  groups 
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of  nine  test  words  for  the  five  different  speakers  were  combined  into  eight  test  tapes.  (To  obtain 
eight  tapes,  each  of  the  four  readings  for  each  speaker  was  used  twice.)  Each  test  tape  was  con¬ 
structed  so  that  it  consisted  of  180  words  in  20  groups  of  nine  making  up  a  complete  set  of  letters  and 
numbers  for  each  speaker,  but  the  word  groups  for  the  different  speakers  were  intermingled  so  that 
the  listeners  would  not  be  able  to  guess  what  words  might  come  next  based  on  what  they  had  heard 
before.  The  order  of  the  speakers  was  balanced  across  the  eight  test  tapes  but  appeared  random 
within  a  tape.  Each  tape  lasted  5.25  min. 

Voice  Conditions 

There  were  six  digitally  processed  voice  conditions  as  well  as  the  unprocessed,  high-quality 
speech.  There  were  five  LPC  conditions  at  bit  error  rates  of0%,  2%,  5%,  8%,  and  12%.  The  LPC 
tapes  for  both  the  DRT  and  spelling  alphabet  tests  were  generated  by  processing  the  tape  recorded 
materials  through  a  low  data-rate  voice  terminal  built  by  TRW.  This  terminal  does  not  incorporate 
the  LPC  enhancements  (Kang  and  Everett,  1982;  1984)  that  will  be  included  in  the  ANDVT  and  STU 
III  equipments,  so  the  newer  devices  can  be  expected  to  perform  somewhat  better.  Random  bit  errors 
for  the  various  error  conditions  were  introduced  into  the  bit  stream  between  the  analysis  and  synthesis 
portions  of  the  processing.  The  last  digital  condition  was  the  800  bit/s  pattern  matching  algorithm 
based  on  the  standard  2400  bit/s  LPC,  which  was  tested  for  comparison  with  the  LPC  with  errors. 
The  800  bit/s  speech  was  obtained  using  the  same  TRW  terminals,  set  to  operate  at  the  800  bit/s  rate. 
More  recent  versions  of  this  algorithm  are  improved  but  were  not  available  in  real  time  when  these 
experiments  were  conducted. 

Test  Procedure 

The  testing  sessions  consisted  of  a  familiarization  procedure  followed  by  a  practice  test,  after 
which  the  eight  test  tapes  were  presented  in  two  groups  of  four  with  a  10-15  minute  rest  between  the 
two  halves.  For  the  familiarization,  the  listeners  were  told  how  the  spelling  alphabet  is  used  and  that 
each  letter  is  represented  by  a  word  beginning  with  the  corresponding  letter.  They  then  heard  each  of 
the  five  speakers  reading  the  list  of  spelling  alphabet  words  and  the  digits  in  alphabetical  and  numeri¬ 
cal  order.  For  the  practice  test,  they  heard  a  moderately  difficult  voice  condition  (LPC  speech  with 
2%  bit  errors).  They  were  given  answer  sheets  with  blanks  printed  on  them  in  groups  of  nine  and 
were  told  to  write  down  the  letter  or  number  corresponding  to  each  word  that  they  heard.  After  the 
practice,  listeners  were  permitted  to  check  their  answers  against  a  list  of  the  correct  words  and  were 
reminded  again  of  the  correspondence  between  the  words  and  the  letters.  Some  listeners  reported  that 
they  sometimes  had  a  tendency  to  substitute  letters  for  numbers  (especially  where  the  digit  words 
were  slightly  different  from  the  everyday  pronunciation),  and  they  were  asked  to  be  extra  careful 
about  this.  The  test  tapes  were  presented  in  the  same  manner  as  the  practice  tape  with  similar  answer 
sheets.  The  first  tape  in  each  group  of  four  was  unprocessed  speech,  so  the  unprocessed  version  was 
heard  twice.  The  six  digitally  processed  speech  conditions  were  presented  in  different  counterbal¬ 
anced  orders  for  different  groups  of  subjects  so  that  any  possible  effects  of  practice  or  fatigue  would 
be  balanced  across  test  conditions.  A  total  of  56  listeners  were  tested;  however,  the  data  from  four 
had  to  be  discarded  because  the  experimenter  used  the  wrong  tapes,  so  there  were  52  complete  sets  of 
listener  data.  The  subjects  were  tested  in  groups  of  one  to  four,  and  the  test  tapes  were  played  over 
high-quality  stereo  headphones  using  a  Nagra  IIS  tape  recorder. 

Scoring  Procedure 

The  listeners'  responses  were  typed  into  a  computer  for  scoring  and  sorting.  The  responses 
were  recorded  as  nearly  as  possible  exactly  as  written  by  the  listeners.  In  spite  of  instructions  to 
write  distinctly  and  to  distinguish  between  digits  and  letters  that  might  be  confused  (2-Z.  5-S.  1-1. 
zero-0),  some  of  the  responses  were  ambiguous.  The  person  who  typed  in  the  responses  had  a  list  of 
the  correct  responses,  and  when  it  seemed  obvious  which  response  was  intended,  this  was  used.  That 
is  FIFE  and  SIERRA  do  not  sound  at  all  alike,  so  when  it  was  not  clear  whether  “S"  or  "5"  was 
written,  the  correct  answer  would  be  assumed;  however,  where  a  letter  was  written  for  a  number 
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(e.g.,  F  for  FIFE),  the  actual  response  was  recorded,  even  though  one  might  guess  that  the  word  was 
actually  heard  correctly.  The  reader  who  wishes  to  make  adjustments  for  this  type  of  error,  which 
probably  slightly  inflates  the  overall  error  rates,  can  consult  the  confusion  matrices  given  in  the 
appendix  of  this  report  to  find  the  number  of  times  “T”  was  given  for  TREE,  “N”  for  NINER,  and 
so  on.  In  the  highest  error  conditions,  many  responses  were  skipped.  In  a  few  cases  where  it  was 
obvious  that  a  series  of  several  correct  responses  was  merely  displaced  by  one  line,  the  correct 
responses  were  recorded,  but  a  single  response  with  blanks  before  and  after  it  was  always  recorded  in 
its  given  position,  even  if  it  seemed  to  be  displaced,  to  avoid  guessing  errors. 

RESULTS 

Table  2  shows  the  percentage  of  correct  responses  on  the  alphabet  word  task  and  on  the  DRT 
test  for  the  various  speech  conditions.  As  expected,  the  highly  discriminable  spelling  alphabet  words 
were  more  intelligible  under  the  degraded  digital  voice  conditions  than  were  the  rhyming  DRT  words. 
However,  at  the  highest  bit  error  rates,  the  intelligibility  of  the  alphabet  words  fell  off  more  rapidly. 
At  the  12%  error  rate,  only  slightly  more  than  one-half  the  words  were  correctly  identified,  and  the 
alphabet  word  score  was  nearly  as  low  as  the  DRT  score.  The  Digital  Voice  Processor  Consortium 
reported  on  tests  of  the  DoD  standard  LPC  processor  (Sandy,  1984)  with  random  bit  errors  at  0%, 
0.5%,  1%,  2%,  and  5%.  For  the  conditions  that  were  the  same,  the  scores  reported  here  are  very 
similar  to  the  consortium  results,  although  the  present  scores  were  about  one  percentage  point  lower 
than  those  obtained  in  the  consortium  tests.  This  difference  can  probably  be  attributed  to  minor 
differences  in  the  implementation  of  the  LPC  algorithm  in  the  voice  processors  that  were  used  for  the 
different  tests. 


Table  2  —  ICAO  Spelling  Alphabet  Word 
Scores  and  DRT  Scores  for  the 
Degraded  Digital  Speech  Conditions 


Speech 

Condition 

DRT 

Score 

ICAO  Alphabet 
Score 

Unprocessed 

97.6 

99.0 

LPC  0%  errors 

86.0 

98.0 

LPC  2%  errors 

81.9 

96.2 

LPC  5%  errors 

75.4 

91.3 

LPC  8%  errors 

65.6 

80.0 

LPC  12%  errors 

52.3 

54.1 

800  bit/s  system 

77.4 

95.4 

Figure  2  shows  the  effect  of  bit  errors  on  the  ICAO  spelling  alphabet  words  and  on  DRT  scores, 
and  Fig.  3  shows  the  relationship  between  DRT  scores  and  alphabet  word  scores.  The  curves  that  are 
shown  are  the  best  fitting  line  for  DRT  scores  and  the  best  fitting  second-order  polynomial  for  the 
alphabet  woids;  a  third-order  polynomial*  can  be  used  to  describe  the  relationship  between  DRT 
scores  and  spelling  alphabet  word  scores  within  the  range  tested  here.  The  prediction  equations  are: 

DRT  from  bit  error  rate:  Y  =  87.474  -  2.821  X 

ICAO  words  from  bit  error  rate:  Y  =  97.288  +  0.912  X  -  0.363  X2 

ICAO  words  from  DRT:  Y  =  -296.2  +  1 1.82  X  -  0.1 18  X2  +  0.000393  X} 


*A  second-order  polynomial  “fits"  the  data  just  as  well,  but  would  predict  a  drop  in  the  intelligibility  of  spelling  alphabet 
words  for  the  highest  DRT  scores. 
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Fig.  2  —  DRT  scores  and  scores  for  the  ICAO  spelling  alphabet  and  digits  for  the 
standard  LPC  algorithm  at  2400  bits/s  as  a  function  of  random  bit  error  rate 


Fig.  3  —  Scores  for  the  ICAO  spelling  alphabet  and  digits  as  a 
function  of  DRT  score.  Scores  should  not  be  extrapolated  beyond 
the  range  of  the  data  reported  here. 


These  equations  should  be  considered  to  be  purely  descriptive.  No  underlying  theoretical  relationship 
is  implied,  and  the  equations  should  not  be  used  for  extrapolation.  This  is  not  a  serious  limitation 
since  intelligibility  cannot  exceed  100%,  and  a  voice  system  for  which  only  about  one-half  of  the 
words  can  be  understood  is  for  practical  purposes  unusable. 

The  36-word  vocabulary  consisting  of  the  spelling  alphabet  words  and  the  ten  digits  behaved 
similarly  to  the  small  response  set  tests  in  the  Miller  et  al.  (1951)  experiments,  whereas  the  DRT, 
which  has  two  highly  similar  rhyming  response  alternatives  for  each  item,  behaved  very  similarly  to 
the  large  response  set  tests.  By  manipulating  the  discriminability  of  the  items,  the  usual  effect  of  size 
of  the  response  set  can  be  reversed.  These  data  still  conform  to  the  general  principle  that  easy 
discriminations,  whether  due  to  restricting  the  response  set  or  to  increasing  the  discriminability  of  the 
responses,  remain  intelligible  longer  than  difficult  ones.  This  experiment  as  well  as  the  results  of 
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Miller  et  al.  suggest  that  the  easiest  discriminations  are  highly  intelligible  under  increasingly  severe 
degradations  up  to  the  point  where  it  becomes  practically  impossible  to  understand  any  speech  at  all. 
and  then  intelligibility  rapidly  degenerates.  More  difficult  discriminations  begin  to  show  some  losses 
even  under  mild  degradations  and  tend  to  fall  off  gradually  with  increasing  speech  degradation.  The 
Consortium  test  results  (for  the  range  from  0%  to  5%  bit  errors)  also  showed  a  linear  relationship 
between  bit  error  rate  and  DRT  scores.  Moser  and  Dreher  (1955)  in  reporting  on  the  development  of 
the  ICAO  alphabet  indicated  a  curvilinear  relationship  between  S/N  and  the  intelligibility  of  spelling 
alphabet  words.  However,  the  difference  in  difficulty  between  the  rhyme  test  task  and  the  word  task 
found  here  differs  from  results  reported  by  Webster  (1972),  who  states  that  .  .  Rhyme  test  scores 
are  generally  equivalent  to  scores  on  Brevity  Code  words  (including  digits  and  ICAO  phonetic  spell¬ 
ing  words).”  These  different  results  may  be  due  to  differences  in  the  test  methods  that  were  used. 
For  making  comparisons  between  different  intelligibility  tasks,  it  is  important  to  understand  the  task 
variables  that  contribute  to  the  ease  or  difficulty  of  the  discrimination.  With  small  sets  of  response 
alternatives,  it  is  also  essential  to  make  the  appropriate  corrections  for  guessing.  Voiers  (1983) 
showed  that  the  DRT  and  the  Modified  Rhyme  test  (MRT)  (House  et  al.,  1965)  are  highly  correlated, 
and  the  scores  tend  to  be  similar  when  the  appropriate  corrections  for  guessing  are  made.  Without 
such  corrections,  the  lowest  score  for  a  two-choice  test  would  be  50%,  but  it  would  be  20%  for  a 
five-choice  test.  Clearly,  with  large  numbers  of  choices,  the  effect  of  guessing  becomes  insig¬ 
nificant.  Figure  3  relates  DRT  scores  specifically  to  spelling  alphabet  words  and  digits.  A  similar 
relationship  can  only  be  expected  for  other  pairs  of  tasks  if  the  task  characteristics  are  similar  (i.e. , 
other  fixed  choice  rhyme  tests  such  as  the  MRT  and  other  small,  distinctive  vocabularies). 

It  can  be  seen  from  Fig.  3  that  for  DRT  scores  down  to  about  75,  over  90%  of  spelling  alphabet 
words  are  correctly  recognized,  and  at  a  DRT  score  of  65,  more  than  80%  of  the  words  are  still 
recognized  correctly.  Experienced  communicators  who  are  highly  familiar  with  the  vocabulary  can  be 
expected  to  perform  even  better  than  these  tests  indicate.  In  constrained  contexts  with  standard 
phrases  and  a  distinctive  vocabulary,  communication  can  be  good  even  when  the  DRT  score  for  the 
voice  system  is  as  low  as  75,  a  score  that  would  be  considered  to  be  "poor"  to  “very  poor.”  Even 
with  a  system  that  has  a  DRT  score  below  the  “unacceptable"  level,  some  communication  should  be 
possible,  but  it  would  be  essential  to  confirm  all  information,  and  some  errors  as  well  as  slower  com¬ 
munication  might  be  expected.  When  the  DRT  score  falls  below  65,  communication  can  be  expected 
to  be  highly  unreliable  at  best. 

Speaker  Differences 

The  speakers  used  in  these  tests  were  to  some  extent  selected  for  their  different  voice  charac¬ 
teristics.  Of  the  male  speakers,  DC  was  known  from  past  experience  to  have  a  voice  that  performed 
particularly  well  with  the  LPC  processor,  whereas  HM  had  a  voice  known  to  be  difficult  over  LPC, 
and  CT  had  a  voice  at  the  higher  end  of  the  pitch  range  for  male  voices.  Of  the  two  female  voices. 
VV  had  a  higher  pitch  than  AS.  (LPC  tends  to  perform  more  poorly  with  higher  pitched  voices.)  As 
was  to  be  expected,  there  were  large  individual  differences  among  speakers  in  the  number  of  alphabet 
words  that  were  recognized,  and  speaker  differences  increased  with  increasing  bit-error  rates.  Figure 
4  shows  individual  speaker  scores  for  the  bit-error  conditions.  This  gives  some  suggestion  of  the 
range  of  scores  that  can  be  expected  with  different  speakers.  If  the  scores  for  HM  (the  known  poor 
speaker  over  LPC)  are  excluded,  the  usual  finding  that  female  voices  perform  more  poorly  over  LPC 
are  excluded,  the  usual  finding  that  female  voices  perform  more  poorly  over  LPC  than  male  voices  is 
also  confirmed.  Since  the  speakers  for  the  alphabet  word  tests  were  not  the  standard  DRT  speakers, 
it  is  difficult  to  make  direct  comparisons  with  the  DRT  results.  There  were  also  speaker  differences 
and  sex  differences  in  the  DRT  scores.  Figure  5  shows  the  expected  difference  between  male  and 
female  speakers  for  the  DRT  scores. 

Word  and  Phoneme  Confusions 

The  words  that  are  most  likely  to  be  confused  with  one  another  are  quite  different  with  the  digi¬ 
tal  degradations  used  here  than  the  types  of  confusions  that  were  found  with  noise  degradation  when 


%  Bit  errors 

Fig.  4  —  Speaker  differences  for  the  ICAO  spelling  alphabet  and  digits  for  the 
standard  LPC  algorithm  as  a  function  of  random  bit  error  rate.  Speaker  differences 
increase  as  the  speech  becomes  more  degraded. 


Males 

Females 


Fig.  5  —  DRT  scores  for  male  and  female  speakers  for  the  standard  LPC  algorithm  at 
2400  bits/s  as  a  function  of  random  bit  error  rate.  Scores  for  female  speakers  are 
lower  and  decrease  more  rapidly  than  for  male  speakers. 


the  research  to  select  the  currently  used  ICAO  alphabet  was  being  conducted  (Moser  and  Dreher. 
1955;  Moser,  1959).  Full  confusion  matrices  for  the  spelling  alphabet  words  and  the  names  of  the 
digits  are  given  in  the  appendix  for  each  of  the  conditions  tested.  Since  the  confusions  tended  to  be 
similar  across  bit-error  conditions,  changing  primarily  in  number,  a  confusion  matrix  combining  the 
data  for  the  three  highest  bit-error  conditions  (5%,  8%,  and  12%)  is  included  as  a  summary  of  the 
most  frequent  confusions.  Table  3  lists  the  most  frequently  confused  words,  those  that  were  most 
omitted,  and  the  words  with  the  most  total  errors— confusions  and  omissions. 


For  comparison  with  the  types  of  confusions  that  are  made  in  noise,  some  of  the  most  frequent 
confusions  and  omissions  from  Moser  (1959)  are  shown  in  Table  4.  Since  different  combinations  of 
words  were  being  tested,  and  all  of  the  tests  that  were  reported  by  Moser  (1959)  have  some  words 
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I  able  3  Most  Frequently  Contused  Words.  Most  Frequently  Omitted  Words, 
and  Words  with  the  Most  Total  Errors  tor  LPC  Speech  with  High  Bit-Error  Rates 
(  Averaged  over  the  5%  .  8%  .  and  12%  Conditions) 


Spoken 

CONFUSIONS 

Percent  of 
responses 

OMISSIONS 

TOTAL  ERRORS 

Heard  as 

Word 

Percent  of 
responses 

Word 

Percent  of 
responses 

PAPA 

ALFA 

21.9 

ECHO 

26.6 

PAPA 

56.0 

DELTA 

ALFA 

14  2 

GOLF 

25.6 

ECHO 

44.3 

INDIA 

JULIETT 

10.4 

MIKE 

20.3 

MIKE 

42.8 

TWO 

ZERO 

8  5 

QUEBEC 

19.4 

GOLF 

42.1 

KILO 

ZERO 

6  9 

BRAVO 

19.2 

EIGHT 

32.8 

EIGHT 

SIX 

69 

YANKEE 

17.4 

BRAVO 

32.6 

MIKE 

FIFE 

6.7 

TANGO 

17.1 

TANGO 

32.3 

ZERO 

ZULU 

5  6 

EIGHT 

15.8 

INDIA 

31.5 

OSCAR 

FOXTROT 

5  1 

VICTOR 

14.9 

QUEBEC 

31.5 

ALFA 

l  _ _ 

DELTA 

5.0 

HOTEL 

14.4 

DELTA 

30.0 

Table  4  —  Some  Frequent  Confusions 
and  Omissions  Obtained  by  Moser  (1959) 
in  Tests  of  ICAO  Words  in  Noise 


CONFUSIONS 

OMISSIONS 

Spoken 

Heard 

FOXTROT 

OSCAR 

"ECHO 

ECHO 

X-RAY 

GOLF 

X  RAY 

OSCAR 

JULIETT 

ZULU 

JULIETT 

QUEBEC 

JULIETT 

ZULU 

TANGO 

ECHO 

HOTEL 

VICTOR 

HOTEL 

ECHO 

BRAVO 

that  are  different  from  those  n  the  present  ICAO  alphabet  (e.g.,  FOOTBALL,  NECTAR,  and 
ZEBRA  in  one  version),  it  is  difficult  to  make  precise  comparisons  between  their  results  and  the 
present  data  Only  words  that  do  overlap  with  the  present  ICAO  alphabet  are  included  in  Table  3. 
Total  errors  are  not  shown  because  the  composition  of  the  list  as  a  whole  influences  the  nature  of  the 
error  patterns  Although  the  words  that  were  confused  with  one  another  in  noise  are  quite  different 
from  the  confusions  with  the  digital  degradations,  the  words  that  were  omitted  (i.e.,  not  understood  at 
all)  were  quite  similar  in  the  two  cases  The  most  frequent  confusion  for  the  degraded  digital  speech 
was  PAPA  ALFA,  and  these  words  were  almost  never  confused  in  the  Moser  studies.  The  reverse 
contusion  ALFA  PAPA  was  also  very  rare  for  the  digital  speech.  There  were  a  number  of  highly 
asymmetrical  contusions  for  the  digital  degradations,  whereas  the  general  pattern  of  confusions  tended 
to  be  considerably  more  symmetrical  for  noise  degradation.  This  suggests  that  the  LPC  degradations 
tend  to  eliminate  the  speech  cues  that  are  needed  to  make  certain  distinctions;  specifically,  the  abrupt 
onsets  that  cue  the  stop  consonants  ip.  t.  k.  b.  d.  g)  are  blurred  by  the  LPC  with  bit  errors,  and  the 
sounds  are  heard  as  continuants  instead  (f.  /.  th.  v.  etc  ).  This  effect  can  also  be  seen  in  the  frequent 
KILO  to  ZERO  confusion.  The  LPC  enhancements  (Kang  and  Everett.  1984)  include  a  window 
placement  strategy  that  greatly  improves  the  reproduction  of  these  sounds,  and  some  of  the  confusions 
reported  here  may  be  less  serious  for  the  newer  processors  that  include  the  LPC  enhancements.  The 
confusions  that  were  based  on  vowel  similarity  and  syllable  patterns  were  more  symmetrical,  for 
example  MIKE  and  FIFE  or  ALFA  and  DELTA,  but  even  these  showed  more  errors  in  the  direction 
of  substituting  sustained  sounds  for  stops. 
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Figure  6  shows  the  effect  of  bit  errors  on  DRT  feature  scores.  The  graveness  and  sustention 
features  had  the  lowest  overall  scores  in  all  conditions.  Sustention  is  the  feature  that  allows  us  to  dis¬ 
tinguish  between  p  and  f,  b  and  v,  etc.  Graveness  allows  us  to  distinguish  between  sounds  like  b  and 
d  or  f  and  th,  among  others.  The  nasality  feature  (m/b,  n/d)  scored  high  at  the  low  bit-error  rates  but 
fell  off  more  than  others  with  increased  bit  error  rates.  Voiers  (1983)  also  found  graveness  and 
sustention  to  be  the  most  vulnerable  under  noise  degradation,  but  nasality  remained  robust  in  noise. 


Q  Voicing 

♦  Nasality 

J  Sustention 

♦  Sibilation 
Graveness 

D  Compactness 


Fig.  6  —  DRT  feature  scores  for  the  standard  LPC  algorithm  at  2400  bits/s  as  a  function  of 
random  bit  error  rate.  The  graveness  and  sustention  features  are  the  most  vulnerable  to 
degradation  under  these  conditions. 


CONCLUSIONS 

It  has  been  frequently  observed  that  different  measures  of  speech  intelligibility  are  highly  corre¬ 
lated  with  one  another  (e.g.,  Montague,  1960;  Webster,  1972;  Voiers,  1983).  Thus  once  the  relation¬ 
ship  between  two  measures  has  been  established,  it  becomes  possible  to  predict  one  from  the  other. 
In  this  case  the  highly  reliable  and  widely  used  DRT  scores  can  be  used  to  predict  the  performance  of 
digital  voice  systems  using  small,  distinctive  vocabularies  such  as  the  ICAO  spelling  alphabet  and  the 
digits.  The  relationships  shown  in  Fig.  3  can  be  used  to  interpret  DRT  scores  for  other,  similar  digi¬ 
tal  voice  systems.  This  should  help  to  make  the  DRT  scores  more  meaningful  to  users  of  the  voice 
systems.  Small  changes  in  DRT  scores  at  the  high  end  of  the  scale  are  most  important  for  ordinary 
conversational  speech  where  a  varied  vocabulary  is  to  be  expected  or  for  transmitting  highly  special¬ 
ized  information  where  many  words  may  sound  similar.  With  distinctive  military  vocabularies,  word 
intelligibility  can  be  expected  to  remain  high  even  when  DRT  scores  fall  into  the  poor  range,  but  once 
the  DRT  scores  fall  below  about  75,  the  intelligibility  can  be  expected  to  fall  off  rapidly,  and  at 
scores  below  50,  less  than  one-half  the  words  will  also  be  understood. 

Given  the  negligible  losses  in  spelling  alphabet  and  digit  intelligibility  for  the  LPC  system  that 
was  used  in  the  present  tests,  the  ANDVT  and  STU  III  equipments  incorporating  the  LPC  enhance¬ 
ments,  and  with  DRT  scores  that  are  several  points  higher,  can  be  expected  to  be  virtually  error  free 
on  this  type  of  vocabulary.  Similarly,  the  more  recent  800  bit/s  algorithm  using  line  spectrum  fre¬ 
quencies  (Kang  and  Fransen,  1985)  also  has  a  better  DRT  score  than  the  system  tested  here,  and  over 
96%  of  such  words  can  be  expected  to  be  correctly  understood,  98%  with  good  speakers.  Even 
under  relatively  severe  degradation  due  to  random  bit  errors,  the  recognition  of  distinctive  vocabu¬ 
laries  can  be  expected  to  remain  quite  good.  The  more  degraded  the  voice  system  becomes,  the  more 
important  individual  speaker  characteristics  will  be.  If  conditions  are  expected  to  be  severely 
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degraded,  it  becomes  important  to  use  communicators  whose  voices  perform  well  over  LPC  systems. 
At  the  8%  bit-error  rate,  the  very  good  speaker,  DC,  scored  87%  correct,  whereas  the  poor  speaker, 
HM,  scored  only  73%  in  the  same  condition.  Since  LPC  systems  are  expected  to  be  in  widespread 
use,  a  useful  research  area  will  be  an  investigation  of  the  voice  traits  that  characterize  “good”  and 
“poor”  LPC  speakers. 

The  confusion  matrices  in  the  appendix  can  be  used  to  pinpoint  specific  word  pairs  that  are  the 
most  likely  to  lead  to  difficulties  at  high  bit-error  rates.  For  specific  situations  where  it  is  essential  to 
communicate  certain  information  under  very  poor  conditions,  it  may  be  advantageous  to  select  code 
words  or  names  from  those  items  that  are  the  least  likely  to  be  confused  or  missed. 
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Appendix 

CONFUSION  MATRICES 


These  matrices  are  for  the  ICAO  spelling  alphabet  words  and  the  digits  using  unprocessed 
speech  as  well  as  for  six  digital  speech  conditions. 

1.  Unprocessed  speech,  99.0%  correct. 

2.  LPC  with  0%  bit  errors,  98.0%  correct. 

3.  LPC  with  2%  bit  errors,  96.2%  correct. 

4.  LPC  with  5%  bit  errors,  91.3%  correct. 

5.  LPC  with  8%  bit  errors,  80.0%  correct. 

6.  LPC  with  12%  bit  errors,  54.1%  correct. 

7.  LPC  with  5%,  8%,  and  12%  bit  errors  combined. 

8.  800  bit/s  pattern  matching  algorithm,  95.4%  correct. 

The  total  number  of  possible  responses  for  each  spoken  word  was  260  for  each  matrix  except 
condition  7,  the  combined  matrix,  for  which  there  were  780  total  responses. 


CONFUSION  MATRIX  FOR  CONDITION  LPC  0%  ERRORS.  ALL  SPEAKERS 
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CONFUSION  MATRIX  FOR  CONDITION  LPC  8%  ERRORS,  ALL  SPEAKERS 

RESPONSE 
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CONFUSION  MATRIX  FOR  CONDITION  LPC  2%  ERRORS,  ALL  SPEAKERS 

RESPONSE 
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CONFUSION  MATRIX  FOR  CONDITION  LPC  5%  ERRORS,  ALL  SPEAKERS 

RESPONSE 
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CONFUSION  MATRIX  FOR  CONDITION  LPC  12%  ERRORS,  ALL  SPEAKERS 

RESPONSE 


CONFUSION  MATRIX  FOR  CONDITION  LPC  5.  8.  12%  COMBINED,  ALL  SPEAKERS 

RESPONSE 
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CONFUSION  MATRIX  FOR  CONDITION  800  BIT/S  PROCESSOR,  ALL  SPEAKERS 

RESPONSE 


